APRIORI APPROACH TO GRAPH-BASED CLUSTERING OF TEXT DOCUMENTS by Mahmud
نویسندگان
چکیده
This thesis report introduces a new technique of document clustering based on frequent senses. The developed system, named GDClust (Graph-Based Document Clustering) [1], works with frequent senses rather than dealing with frequent keywords used in traditional text mining techniques. GDClust presents text documents as hierarchical document-graphs and uses an Apriori paradigm to find the frequent subgraphs, which reflect frequent senses. Discovered frequent subgraphs are then utilized to generate accurate sense-based document clusters. We propose a novel multilevel Gaussian minimum support strategy for candidate subgraph generation. Additionally, we introduce another novel mechanism called Subgraph-Extension mining that reduces the number of candidates and overhead imposed by the traditional Apriori-based candidate generation mechanism. GDClust utilizes an English language thesaurus (WordNet [2]) to construct document-graphs and exploits graph-based data mining techniques for sense discovery and clustering. It is an automated system and requires minimal human interaction for the clustering purpose.
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملPlagiarism Detection Considering Frequent Senses Using Graph Based Research Document Clustering
A new, graph based research document clustering technique (GRD-Clust) is introduced based on frequent senses rather than frequent keywords as per the traditional document clustering techniques.GRDClust presents text documents as hierarchal document-graphs and utilizes an Apriori paradigm to find the frequent sub graphs, which reflect frequent senses based on support and confidence. We highlight...
متن کاملClustering Web Documents based on Efficient Multi-Tire Hashing Algorithm for Mining Frequent Termsets
Document Clustering is one of the main themes in text mining. It refers to the process of grouping documents with similar contents or topics into clusters to improve both availability and reliability of text mining applications. Some of the recent algorithms address the problem of high dimensionality of the text by using frequent termsets for clustering. Although the drawbacks of the Apriori al...
متن کاملPerformance Evaluation of an Efficient Frequent Item sets-Based Text Clustering Approach
The vast amount of textual information available in electronic form is growing at a staggering rate in recent times. The task of mining useful or interesting frequent itemsets (words/terms) from very large text databases that are formed as a result of the increasing number of textual data still seems to be a quite challenging task. A great deal of attention in research community has been receiv...
متن کاملAn Optimal Approach to Local and Global Text Coherence Evaluation Combining Entity-based, Graph-based and Entropy-based Approaches
Text coherence evaluation becomes a vital and lovely task in Natural Language Processing subfields, such as text summarization, question answering, text generation and machine translation. Existing methods like entity-based and graph-based models are engaging with nouns and noun phrases change role in sequential sentences within short part of a text. They even have limitations in global coheren...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008